Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation
نویسندگان
چکیده
phrases. The partial parser is motivated by an intuition (Abney, 1991): To acquire noun phrases from running texts is useful for many applications, such as word grouping, terminology indexing, etc. The reported literatures adopt pure probabilistic approach, or pure rule-based noun phrases grammar to tackle this problem. In this paper, we apply a probabilistic chunker to deciding the implicit boundaries of constituents and utilize the linguistic knowledge to extract the noun phrases by a finite state mechanism. The test texts are SUSANNE Corpus and the results are evaluated by comparing the parse field of SUSANNE Corpus automatically. The results of this preliminary experiment are encouraging. (1) When we read a sentence, we read it chunk by chunk. Abney uses two level grammar rules to implement the parser through pure LR parsing technique. The first level grammar rule takes care of the chunking process. The second level grammar rule tackles the attachment problems among chunks. Historically, our statisticsbased partial parser is called chunker. The chunker receives tagged texts and outputs a linear chunk sequences. We assign a syntactic head and a semantic head to each chunk. Then, we extract the plausible maximal noun phrases according to the information of syntactic head and semantic head, and a finite state mechanism with only 8 states.
منابع مشابه
Automatic titling of Articles Using Position and Statistical Information
This paper describes a system facilitating information retrieval in a set of textual documents by tackling the automatic titling and subtitling issue. Automatic titling here consists in extracting relevant noun phrases from texts as candidate titles. An original approach combining statistical criteria and noun phrases positions in the text helps collecting relevant titles and subtitles. So, the...
متن کاملRecherche documentaire par titrage automatique
In this paper, we propose a system in order to facilitate the information retrieval in a set of textual documents. Our approach is based on the automatic titling (and subtitling). This last one is crucial, for example, for the issue of web pages accessibility (W3C standard). Our process of automatic titling consists in extracting relevant noun phrases from texts. These ones can represent a titl...
متن کاملSemi-Automatic Recognition of Noun Modifier Relationships
Semantic relationships among words and phrases are often marked by explicit syntactic or lexical clues that help recognize such relationships in texts. Within complex nominals, however, few overt clues are available. Systems that analyze such nominals must compensate for the lack of surface clues with other information. One way is to load the system with lexical semantics for nouns or adjective...
متن کاملSurface Grammatical Analysis For The Extraction Of Terminological Noun Phrases
LEXTER is a software package for extracting terminology. A corpus of French language texts on any subject field is fed in, and LEXTER produces a list of likely terminological units to be submitted to an expert to be validated. To identify the terminological units, LEXTER takes their form into account and proceeds in two main stages : analysis, parsing. In the first stage, LEXTER uses a base of ...
متن کاملTerminology extraction from medical texts in Polish
BACKGROUND Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consistency of hospital documentation and obtaining statistical data. To perform this task we need informat...
متن کامل